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1. Introduction 


People tracking is a topic that has attracted the interest of researchers in several fields such as ambient intelligent systems 
[12,29,49,58], visual servoing [3,43], human-computer interaction [10,25,68], video compression [45,67] or robotics [6,50]. 
Although most of the research has focused on single camera approaches, this configuration is not the ideal solution for multi- 
people tracking due to the occlusion problem and because the space covered by a single camera might be too small for cer- 
tain applications. Multiple camera tracking approaches aim to solve these problems. 

Among the numerous approaches proposed for tracking, Bayesian filtering is the most frequently employed. In particular, 
particle filters [24,30,31,38] have gained popularity in the vision community because of their advantages over the Kalman 
filter [26]. First, particle filters are able to carry multiple hypotheses simultaneously. And second, they can deal with non- 
linear and non-Gaussian systems. Nevertheless, the main obstacle to applying Bayesian filters is the effort some problems 
require for finding precise probabilistic models that describe the process or the sensors employed. In many real scenarios, 
it may be difficult to obtain complete knowledge of the problem due to high occlusion, background clutter, illumination 
or camera calibration errors. In order to fuse data handling uncertainty, the Bayesian theory [55] requires prior probability, 
likelihoods, and posterior probabilities to be defined. Without precise information, these three elements might not be de- 
fined properly, thus leading to assumptions and restrictions of the problem. Furthermore, in real tracking problems a large 
number of cameras are required. Then, it is likely to find situations in which a target is not visible in some of the cameras. In 
that case, proposing an observation model using particle filters is complicated, i.e., which likelihood (p(z|x)) must be assigned 
to a state that cannot be observed? How do we fuse information from multiple cameras using a Bayesian approach if one of 
them does not observe the target? 
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An alternative to deal with the above mentioned difficulties is Dempster-Shafer (DS) theory of evidence [2,5,8,17,27,57]. 
DS theory is able to model systems assuming that the knowledge about the problem is not completely precise, thus allowing 
the natural manipulation of ignorance and uncertainty. The above formulated questions are be answered by the DS theory 
using the “unknown” subset representing the absence of knowledge about an state. This work proposes a reformulation of 
the classical particle filtering algorithm from the basis of the DS theory of evidence. The evidential filter proposed is specially 
designed for fusing data from multiple sensors and is applied in this work to solve the multiple camera people tracking 
problem. 


1.1. Related work 


The multi-camera people tracking problem has been addressed from different perspectives. In [34], Kang et al. propose a 
method for tracking multiple objects employing a homography that registers the cameras on top of a known ground plane. 
Multi-camera tracking is formulated as the maximisation of a joint probability model based on the colour of the blobs de- 
tected after background subtraction. The motion of the models is estimated using Kalman filters and the 3D positions of the 
objects are obtained from the observation of the positions of people’s feet. Data association is carried out using the Joint 
Probabilistic Data Association Filter (JPDAF). In [37], Kim and Davis propose a method for tracking people in multiple-views 
using a particle filtering approach. After background subtraction, the foreground pixels are classified into blobs that are as- 
signed to the people being tracked. The information from the multiple cameras is then combined to determine the ground 
plane position. To do so, the centre vertical axes of each person across views are mapped to the ground plane and their inter- 
section point on the ground is estimated. The method requires that people’s feet are visible and the ground plane homog- 
raphy. Similarly, the work in [4] employs the floor homography for tracking the 3D positions of objects using a Kalman 
filter. Ref. [35] also uses the ground plane homography to track people using a look-ahead technique that combines infor- 
mation from multiple frames in order to detect people’s paths. 

The works revised above require the whole silhouette of the people being tracked to be visible. For this purpose, the cam- 
eras must be placed at elevated positions and relatively far from the people. Although this restriction could be feasible in 
outdoor scenarios, it might be impossible in indoor scenarios where the areas to be covered are small and cameras must 
be placed nearer to the people. A solution to the tracking problem in indoor environments can be found in Ref. [21]. In that 
work, Fleuret et al. present a tracking approach using multiple cameras placed at eye level. They employ a generative model 
to determine the ground locations of people at each frame. For that purpose, the monitored area is discretized into cells to 
create a probabilistic occupancy map. At each frame, they employ an iterative process in order to determine the locations of 
the people. Although the authors claim that the computing time is improved by the use of integral images, their applicability 
imposes restrictions on the camera positions, i.e., the cameras must be placed in such a manner as to prevent people from 
appearing to be inclined in the images. Furthermore, the complexity of their approach grows exponentially with the size of 
the area monitored. In [22], the authors describe a distributed tracking system using multiple cameras. At each frame, inde- 
pendent blobs are detected at each camera and passed to a centralised tracker that estimates the 3D people locations. They 
test both a best hypothesis heuristic tracking approach and a probabilistic multi-hypothesis tracker, reporting similar per- 
formance for both methods. 

Other authors have employed stereo information in order to enhance tracking. The authors of this work have proposed 
several approaches for people detection and tracking using a single stereo camera. While Ref. [51] proposes a tracking ap- 
proach combining colour and stereo extracted directly from camera image, the work in [52] proposes the use of plan-view 
maps to represent stereo information more efficiently. However, using a single stereo camera still limits the area of surveil- 
lance. Therefore, some authors have proposed tracking approach using multiple stereo cameras. In [47], Mittal and Davis 
present a probabilistic approach for tracking people in cluttered scenes using multiple monocular cameras. They employ 
a fine camera calibration and compute epipolar lines for each camera pair. People are defined using a cylindrical colour mod- 
el that registers the colour of their clothes. After background subtraction, the foreground pixels are assigned to the people 
being tracked. The stereo information of the people being tracked is then extracted by matching foreground segments across 
adjacent cameras. This information is then projected onto a ground map to detect the positions of the people being tracked. 
The main drawback of their approach is that their algorithm requires several iterations per frame in order to achieve con- 
vergence. In [40], Krumm et al. show a people tracking system for a smart room. In their work, a pair of stereo cameras with 
a short base line is employed to monitor the area of interest. The extrinsic camera parameters are roughly estimated by 
matching the paths of people walking in the room in an initial stage. People are detected by grouping 3D blobs extracted 
from the stereo information. Tracking is performed on ground plane coordinates merging past observations and colour infor- 
mation. Three-dimensional information is also employed in [70] for localising people. In that case, the volumetric informa- 
tion is obtained with a standard visual hull procedure. 

An important aspect of the works reviewed above is that tracking is performed only in the area where the cameras field of 
view (fov) intersect, i.e., the area visible by all cameras. This is a limitation for many applications, specially when the number 
of cameras grows in order to cover large areas. In that case, there might be a group of cameras with overlapping fovs but not 
all of them might share a common visible area. Applying a particle filter for fusing information in that case is not straight- 
forward. Each particle could be evaluated independently in each camera and then fuse evidences using a joint approach. 
However, which likelihood must be assigned to a particle that cannot be even observed? We propose a novel evidential filter 
based in the Dempster-Shaffer theory of evidence to solve that problem. 
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Previous works have proposed the use of evidential filters as an alternative to traditional Bayesian filters. In Ref. [32], 
Kagiwada and Kalaba formulates a theoretical non-linear evidential filter based on dynamic programming and fuzzy sets. 
Fuzzy sets are employed to compute a degree of belief about the presence of the target in each of the cells in which the 
Space is divided. The belief of a cell is considered the maximum one provided by all sensors thus discarding possibly 
important information provided by the rest of sensors. Another problem of their approach is that a discretization of 
the space is required. Moreover, the proposal requires that each cell be estimated, thereby reducing the scalability of 
the method. In Ref. [42], Mahler designs an evidential filter using the Dempster-Shafer (DS) theory of evidence [57]. This 
filter is an extension of the Kalman filter which is applicable when the measurement of uncertainty is modelled in the DS 
domain. However, it can only be employed to model linear systems with Gaussian noise. In Ref. [63], Smets and Ristic 
develop a novel solution to the tracking and classification problem using the Transferable Belief Model (TBM) as an 
extension to traditional approaches based on the Kalman filter. In Ref. [64], the authors propose an evidential filter 
using a particle filtering approach where observations are measured using the Dezert-Smarandache Theory (DSmT). 
The DSmT is an extension of the DS theory for modelling the paradoxical interpretation of conflicting sources of informa- 
tion. In their work, DSmT is employed to fuse the colour and position features while tracking two people using a single 
camera. In their approach, the degree of evidence for all targets are calculated at each particle, i.e., a joint configuration is 
employed. Thus, the main problem of their approach is that the complexity of their filter grows exponentially with the 
number of targets. Furthermore, they do not deal with the problem of fusing information in case of partial or total 
occlusion of the targets. Another interesting piece of work related to ours is in Ref. [39]. The authors employ a traditional 
particle filter scheme for tracking vehicles in roads. To do so, several features are extracted from image patche and com- 
bined using the TBM. Then, the fused evidences at each particle are transformed into probabilities using the pignistic 
transform. The main drawback of their method is that it is specifically designed for the car tracking problem and for a 
single target. 


1.2. Proposed contribution 


This work proposes a novel evidential particle filter for tracking multiple targets using a set of heterogeneous and possibly 
unreliable sensors. The problem is formulated in terms of the DS theory of evidence. The DS theory is a generalisation of the 
Bayes theory of subjective probability for the mathematical representation of uncertainty. It has been applied to several dis- 
ciplines such as fraud detection [54], classification [16], risk analysis [13], clustering [15,44], image processing [7,5,28,56], 
autonomous robot mapping [72], human-computer interaction [69], land mine detection [46] and diagnosis [71], amongst 
others. 

The proposed algorithm models, the possible states of the dynamic system being tracked as a set of particles. For each 
particle, sensors estimate a degree of evidence that particles will represent the true target state. But the sensors also provide 
a degree of evidence about their own reliability. In a final data fusion step, data collected from all the sensors are fused to 
provide the best location hypothesis taking uncertainty into account. The modelling of uncertainty and absence of knowl- 
edge of our approach is especially attractive since it does not require specifying priors or conditionals that might be difficult 
to obtain in complex problems. Since joint particle filters suffer from the curse of dimensionality [66], our algorithm employs 
a multiple particle filtering approach [18,19,36,51,53,65], i.e., an independent particle filter is employed for each target and 
possible interactions are considered. 

The proposed tracking algorithm, called the Multiple Evidential Particle Filter (MEPF), is a general tracking algorithm 
for multiple targets using multiple sensors. This algorithm is employed to provide a novel solution to the multi-camera 
people tracking problem. For each camera, our approach computes a degree of evidence about the possibility of finding 
the tracked person at the particle location. For that purpose, a generative-based approach that analyses the projection 
of a 3D person model in the camera images is employed. Foreground, colour and shape information are used to compute 
a degree of evidence for each camera. Using a depth-ordering scheme, occlusion is calculated separately in each camera. 
Occlusion is treated by our algorithm as the absence of knowledge about the locations of the people being tracked. In the 
data fusion step, the evidence collected from all the cameras is fused in order to obtain the best estimation of the target 
location. Information from unreliable cameras (those with high occlusion or that only partially see the target) is weakly 
considered. 

This paper makes two main contributions. First, an evidential filtering algorithm is proposed for tracking 
multiple targets by fusing information from multiple unreliable sensors. Since independent trackers are employed instead 
of a joint configuration, the complexity of the algorithm grows linearly with the number of targets instead of 
exponentially as in [64]. Moreover, it does not require assuming the Gaussian and linear conditions imposed by the 
Kalman filter [42]. Second, a novel solution to the multi-camera people tracking problem in indoor environments is pro- 
posed. Our approach does not require the whole silhouette of the targets to be visible but undergoes a process of reason- 
ing using the visible portion of the targets while considering the uncertainty associated to the lack of visibility and 
occlusion. Additionally, it is not necessary to explicitly compute stereo information as in Refs. [47,40,70]. Instead, the 
locations of people are estimated by intersecting evidence collected from multiple cameras. Furthermore, the proposed 
approach does not require a discretization of the space as in [21,32], meaning that the scalability of the algorithm is 
better. 
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The remainder of this paper is structured as follows. In Section 2 the basis of the DS theory of evidence is explained, 
while the proposed MEPF algorithm is explained in Section 3. Section 4 shows the proposed multiple camera people 
tracking solution using the MEPF algorithm. Finally, the experiment is shown in Section 5 and conclusions are drawn in 
Section 6. 


2. Dempster-Shafer theory of evidence 


The DS theory, which is also known as the evidence theory, is a generalisation of the Bayes theory of subjective proba- 
bility. It includes several models of reasoning under uncertainty such as the Smets’ Transferable Belief Model (TBM) [59]. 
The DS approach employs degrees of evidence that are a weaker version of probabilities. The management of uncertainty 
in the DS theory is especially attractive because of its simplicity and because it does not require specifying priors or condi- 
tionals that might be unfeasible to obtain in certain problems. In the DS domain, it is possible to set a degree of ignorance to 
an event instead of being forced to supply prior probabilities adding to unity. 

Let us consider a variable œw taking values in the frame of discernment Q and let us denote to the set of all its possible 
subsets by 2° (also called power set). A basic belief assignment (bba) 


m : 2° — [0,1] 
is a function that assign masses of belief to the subsets A of the power set, verifying: 


NO m(A)=1. (1) 


AEQ 


While the evidence assigned to an event in the Bayesian approach must be a probability distribution function, the mass m(A) 
of a power set element can be a subjective function expressing how much evidence supports the fact A. Furthermore, com- 
plete ignorance about the problem can be represented by m(Q) = 1. 

The original Shafer’s model imposes the condition m(ø) = 0 in addition to that expressed in Eq. (1), i.e., the empty subset 
should not have mass of belief. However, Smets’ TBM model relaxes that condition so that m((@) > 0 stands for the possibility 
of incompleteness and conflict (see Ref. [60]). In the first case, m(@) is interpreted as the belief that something out of Q hap- 
pens, i.e., accepting the open-world assumption. In the second case, the mass of the empty set can be seen as a measure of 
conflict arising when merging information from sources pointing towards different directions. 

Nonetheless, a renormalisation can transform a Smets’ bba m into a Demspter’s bba m* as: 


= _——_ ifA¥9. 2) 
m 


One of the most attractive features of DS theory is the set of methods available to fuse information from several sources. 
Let us consider two bbas m, and m, representing distinct pieces of evidences, the standard way of combining them is using 
the conjunctive sum operation [61] defined as: 


(m,Om)(A) = X` m(B)m2(C), VACQ. (3) 


BnC=A 
The Dempster’s rule of combination can be derived from Eq. (3) by imposing normality (i.e., m(0) = 0) as: 


1 


(mı & M2)(A) = TK 


XO m(B)m2(C), VACQ, A¥ 4, (4) 


BOC=A 
with 
K = (m,@m2) (0). (5) 
The above rules assume that the sources manage independent pieces of information. However, if information is corre- 
lated, the cautious rule should be employed [14]. 


In some applications it is necessary to make a decision and choose the most reliable single hypothesis œ. To do so, Smets 
and Kennes [62] proposed the use of the pignistic transformation that is defined for a normal bba as: 
m(A 
BetP(œ)= X H, (6) 


ACQ wEA 


where |A| denotes the cardinality of A. 
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MEPF Iteration 
From the sample set C(t — 1) construct a new sample set C(t) as: 


(i) Resampling and propagation: 
— Calculate the normalised cumulative relevance of each sample crt a as 


ri 
t—1 m 
E Y ifi = 1; 
sum(rt_1) 
i Z 
cri = 
cr’) + ri 
t—1 t—1 . 
————— otherwise, 
sum(Te-1) 


N 


where sum(ri_—1) = De 
j=1 
— Select the new particles from the set C(t — 1) repeating for i = 1... N: l 
- Generate a random number r,U (0,1) and by binary subdivision find the smallest j for which Cry >r. 


- Set ci = Ca and propagate the particle state according to the target’s movement model: 
ei =Ae!_,+w(t-1) (8) 
Belief Estimation: For each particle i and each sensor v, calculate the masses M? (a) € Mt. 
Data fusion: Fuse the data of each particle obtaining Mi and calculate its relevance taking into account interactions with other 
targets as: 
ry = R(M;) * Ti. (9) 
State Estimation: Obtain the best state estimation E[C(t)] = 7? where 


b = argmax(r‘). 
t 





Fig. 1. MEPF algorithm. 
3. Multiple evidential particle filtering 


Using the DS theory, and also inspired in particle based algorithms, in this section we explain the Multiple Evidential Par- 
ticle Filtering (MEPF) algorithm proposed in this work. The goal of tracking is to estimate the state of a dynamic system. The 
system might be comprised of a set of n subsystems, each of which has its own dynamics such that 


X, = {x1,..., XT}. 


The underlying idea of our algorithm is similar to that employed in particle filtering approaches. The true target state is esti- 
mated from a set of possible states (called particles). The main difference with regard to particle filtering approaches is that 
our proposal does not evaluate the likelihood of particles (in the Bayesian sense), but their degree of evidence (in the DS 
sense). The algorithm is specifically conceived to simultaneously deal with multiple sensors. Hence, the evidence of particles 
is evaluated using all the available sensors and finally fused. Let us denote the total number of available sensors by V. 

To avoid the curse of dimensionality that arises when a joint state configuration is employed, a separate tracker is em- 
ployed for each target. Nevertheless, target interactions are considered by using an interaction factor to maintain multi- 
modality and avoid the coalescence problem, as explained below. Each independent tracker keeps a set of N particles. For each 
particle, each sensor is asked: Is the target at the particle location? Using symbols, we define the facts to be evaluated for a 
sensor at each particle as: 


S = {present, spresent}, 

that define the power set: 
P(S) = {0, {present}, {=present}, {unknown}}. 

Then, for each type of sensor, a bba must be defined for the elements of P(.”). By 
M’ (X) = {m’ (present), m” (present), m” (unknown) }, 


we Shall denote the bba provided by the zth sensor about the subsets in the power set. Mass m’ (present) represents the de- 
gree of evidence assigned by the zth sensor to the fact that the target is at x. On the other hand, mass m”(-present) repre- 
sents the evidence that the target is not at x’. Finally, m” (unknown) represents the degree of evidence of the sensor itself, i.e., 
high values of m” (unknown) denote that the sensor is not reliable for that particle. 

On the basis of the power set just defined, let us denote the set of particles of each tracker by: 


C(t) = {cl = (Ri, Mi, Mi, ili =1,...,N}. (10) 
The parameter 


Mi = {M’(x")|v =1,...,V} 
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represents all the bbas provided by the V sensors about the state x‘, and 
Mi — yi (x?) (11) 


represents the bba resulting from fusing the evidence in Mİ using the most appropriate combination rule. Its selection de- 
pends on the nature of the data manipulated. If the pieces of evidence to be fused are uncorrelated, then either the conjunc- 
tive sum operation (Eq. (3)), or the Dempster’s rule of combination (Eq. (4)) might be employed. However, if they are 
correlated, the cautious rule should be employed [14]. 

The relevance of a particle rf is a single value indicating the likelihood that the particle will represent the true target’s 
state. It is computed using the mapping function 2 and an interaction factor #1. The function 


R(M) = Bet P(present) (12) 


is calculated as the pignistic probability (Eq. (6)) of the present event in M. The interaction factor .7: models target interac- 
tions in order to maintain multi-modality and avoid the coalescence problem [9,36,51,65]. The coalescence problem occurs 
when two (or more) targets with similar characteristics are close to each other. In that case, the target that obtains a higher 
relevance might “hijack” the particles of the rest of the trackers. Imagine for example the problem of tracking people based 
on the colour of their clothes. In this case, two people wearing the same clothes might be indistinguishable when they come 
close to each other. If any of the people are severely occluded in all the cameras, the particles of their tracker will move to- 
wards the position of the visible target. The interaction factor is defined such that it tends to 0 when the particles of a tracker 
are near the positions of other targets and tends to 1 when the particle is far from other targets. Therefore, the relevance of 
particles near other targets diminishes. The interaction factor of a particle can be defined as a function that is inversely pro- 
portional to the distance from the nearest target. Since the positions of the targets in time t are not known, the position esti- 
mated by the algorithm in the previous time step is employed. For further information on the role of the interaction factor 
the reader is referred to [9,36,48,51]. 

The outline of the proposed algorithm is shown in Fig. 1. At the beginning, the algorithm is provided by an initial sample 
set C(O) of N particles. The particles in C(O) might be sampled around the initial target position using any suitable distribu- 
tion. At each iteration, the algorithm uses the particle set C(t — 1) to create a new set C(t) by selecting, with replacement, N 
particles from C(t — 1). For that purpose, the cumulative normalised relevance of the particles is calculated first. Using binary 
subdivision, the new particles are then selected by finding the particle whose cumulative normalised relevance is nearer a 
selected random number r. This resampling mechanism permits particles with a high relevance to be selected a greater num- 
ber of times than particles with a low relevance that are rapidly discarded from one iteration to another. As can be observed, 
this is the approach employed in the CONDENSATION algorithm [31]. Afterwards, for each selected particle, the algorithm 
computes its next state x’ according to a dynamic model of the system. (Eq. 8) propagates the state using a transition model 
A affected by some noise w(t — 1). 

Once the new particle set is obtained, all the sensors are employed to calculate the masses of power set Mi. Afterwards, all 
the evidence collected for each particle is fused into ./ and the particle relevance ri is computed using the mapping function 
& and an interaction factor .#'. 

Finally, the algorithm provides the best state estimation E[C(t)] as the state of the particle with a higher relevance x”. The 
main difference with respect to traditional particle filters is that the Bayesian conditions are relaxed, thus allowing non- 
probabilistic distributions to be used when estimating the particle evidence. Furthermore, the use of the DS theory allows 
the reliability of the sensors to be modelled easily. 


4. MEPF for tracking people in multiple cameras 


This section explains how the MEPF algorithm is employed for tracking people using several cameras. First, we provide a 
brief overview of the algorithm. A detailed explanation of the elements required to implement the proposed algorithm is 
then given. 


4.1. Algorithm overview 


The purpose of our people tracking problem is to estimate the ground plane positions 
X: = E 
of a set of people in the area of analysis. Let n, represent the number of people being tracked at time t and xX, a position in the 
ground plane. Let us assume that there is a set of V heterogeneous cameras sharing a common reference system obtained by 
calibration, thus making it possible to know the projection of a three-dimensional point in each of the cameras. Please notice 
that a fine camera calibration is not required since epipolar lines are not employed in this work. Let us also assume that peo- 
ple are mostly seen in a standing position and that there is a people detector mechanism which indicates the positions of the 
people entering the area under surveillance in an initial time step. 


The outline of the proposed algorithm can be seen in Fig. 2. In an initial stage, a background model for each camera is 
created. The background modelling technique proposed in Ref. [20] has been employed in this work. Afterwards, the tracking 
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MEPF Iteration for People Tracking 
From the sample set P(t — 1) construct the new set P(t) as: 


(i) Background Subtraction: perform background subtraction and obtain the foreground map F” for each camera image. 
(ii) Resample and propagate: the particle sets Cp(t — 1). 
(iii) Evidence Estimation: Repeat for each camera v = 1...V 
— Reset the occlusion map O” 
— Sort people incrementally by their distance to the camera 
— For each person p=1...n¢ l 
- Calculate the masses MAT 4) € M,” of the particles. 


- Obtain the particle with higher relevance ee = R(M” (#3 4) for this camera. 
- Project the 3D person model in O” using the position of the particle with higher relevance. 
(iv) Data fusion: Fuse the data of the particle in every particle set CP (t — 1) obtaining Mi g Then, calculate the relevance of particles 
taking into account interactions with other targets as: 


Dt =ER(Mpt) * Tpi (13) 


(v) State Estimation: Obtain the best state estimation E|[Cp(t)]. 
(vi) Colour models and background update: Update the colour models Ap(t — 1) of each person using the estimation obtained and 
the background models for each camera. 





Fig. 2. MEPF for people tracking. 


process starts. Following this stage, the image set is captured and background subtraction is performed. By F” let us denote 
the foreground map obtained for the 2th camera. A pixel of the foreground map is 1 when it is classified as foreground and 0 
otherwise. The trackers then iterate in order to estimate the new locations of the people being tracked. 


By 
A(t) = {P)(t)[p = 1,... n}, 
let us denote the information that each tracker keeps about its target p. P,(t) is defined as: 
P (t) = {Cp (t), ElCp(O)], Ap (OF. 


where C, (ft) represents the particle set (Eq. (10)) and E/C,(t)| the best estimation obtained from the particle set. Additionally, 
each tracker keeps a colour model of the clothes of its target 


A,(t) = {a’,|v =1,...,V}, 


in each of the cameras. Since we consider that the scene might be analysed by cameras with different sensor characteristics 
and that illumination is not uniform, a point in the scene might be seen with a different colour in each of the cameras. There- 
fore, a different colour model aș, is kept for each camera. 

Particle propagation is performed using a random walk movement model because of the unpredictable behaviour of peo- 
ple. Therefore, the matrix modelling the dynamics of system A (Eq. (8)) is set to the identity. The random noise applied is 
assumed to follow a Gaussian distribution N(0, o2 (t — 1)) whose deviation is calculated as: 

2 26% 

o_(t—1)= ier, (14) 
When the person is properly located, the relevance of the best particle r?_, ~ 1. The deviation is then set to the minimum 
value, 62,, representing the distance walked by a person at normal speed. However, as the relevance decreases (indicating 
that the location of the person is not properly known) noise is increased. Consequently, particles are spread over a wider 
search area in an attempt to relocate the target in the next iteration. Value ô? is calculated based on the fact that average 
human walking speed is about 1 m/s (3.6 km/h). Then, if the proposed system is able to operate at fps hertz, the parameter 
G7 is calculated as: 

F 1 

Om Ps (15) 

After propagation, the algorithm proceeds with the particle evaluation. For each particle, each camera evaluates the evi- 
dence of the person at the particle position. Hence, the masses M AE € Mat are calculated for each camera and particle. 
For that purpose, a generic 3D model of a person is rendered in each camera image, assuming that the model is placed at the 
particle position. Then, the number, shape and colour of the foreground points in the projections are analysed. The number 
and shape of the foreground points are evaluated in order to see if the model is projected in an occupied region of the space. 
The colour of the foreground points is employed to create a colour model that is compared against the person colour model 
as. The colour models Ap(t) are initialised from the information of the first frame. Later, they are dynamically updated in 
order to be adapted to illumination changes and body movements. 

One of the most attractive advantages to using multiple cameras is the management of occlusion. When a person is oc- 
cluded in a camera, another camera might be employed to keep track of that person. The algorithm proposed in this work is 
specifically designed to deal with occlusions. For that purpose, an occupation map @” is maintained for each camera. The 
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occupation map has the same dimensions as the original camera image and indicates at each pixel if it is occupied by any of 
the people being tracked. The occupation map is calculated independently for each camera using a depth-ordered approach. 
First, targets are sorted according to their distance from the camera. Then, starting from the nearest person to the camera, the 


evidence of their particles is calculated. The particle with a higher relevance ri? = a(M i C in a camera is selected as the 


p 
best position for that camera. Please notice that ro does not represent the relevance of the fused masses but the relevance in 
camera v. Therefore, the selected position might not be the best global solution but is a good local solution that allows us to 
calculate the occlusion independently for each camera. Thus, step (iii) of the algorithm in Fig. 2 can be distributed in as many 
processes as cameras. The position of the particle with higher relevance in the camera is employed to project the 3D person 
model in the occupation map 0” (setting all the points inside the projection to 1). Afterwards, the particles of the next person 
are evaluated, but this time employing the occupation map to take occlusions into account (this is explained in greater depth 
in Section 4.4). In brief, particles projecting at image positions already occupied by other people are assigned high values of 
uncertainty, i.e., high values of m’(unknown). Therefore, in the data fusion step, cameras in which a person is occluded are 
not as relevant to determining the person’s location as are the cameras in which the person is fully visible. 

When the particle sets have been evaluated in all the cameras, data are fused in order to obtain a global estimation of 
people’s locations. We have employed for this application the Dempster’s combination rule (Eq. (4)) thus assuming indepen- 
dence between the cameras employed. This holds true while cameras see the target from separated points of view. However, 
as the number of cameras in the environment increases, so does the correlation between adjacent cameras. In that case, it 
would be appropriate to develop further fusion strategies considering the cameras degree of correlation (using for instance 
the cautious rule [14]). 

The relevance of particles at this stage represent a global solution that takes into account both the information from all 
the sensors and the information about the rest of the targets via the interaction factor .;, ,. In this work, the interaction factor 
is defined as a Gaussian function 


dm2 


$i, =1—e “in, (16) 





where dm is the Euclidean distance to the nearest target (excluding itself) and Og, the deviation. The deviation is set to 
Oam = 0.5 m, which corresponds to the width of the 3D model employed. The interaction factor tends to 0 for particles drawn 
near the location of other targets, and tends to 1 when it is far from other targets. Therefore, particles drawn near the location 
of other targets are considered inappropriate so they are provided with a low relevance value thus avoiding the coalescense 
problem. 

Using the fused information, the best location hypothesis E[C,(t)] is estimated. Afterwards, in step (vi) of the algorithm, 
the colour models A,(t — 1) are updated using the information from the best hypothesis. The model of each view is updated 
according to its visibility, i.e., the colour models of views where the person is highly visible are more strongly modified than 
the colour models of views where the person is partially occluded (see Section 4.5 for further details). Finally, the background 
models are smoothly adapted to changes in the environment. To prevent people standing for long periods of time from 
becoming part of the background model, pixels marked as occupied in the occupancy maps ©” are not updated. 

In the following sections, we give a detailed explanation of the elements required to implement the proposed algorithm. 
Section 4.2 shows the 3D geometric model employed to model people’s appearance and the information extracted from its 
projection, while the people colour models are explained in Section 4.3. Section 4.4 shows how the masses of the particles 
are calculated. Finally, Section 4.5 explains how the colour models are updated to adapt them to illumination and body pose 
changes. 


4.2. 3D model projection 


The proposed method relies on the use of a geometric 3D model representing the shape of people. We have selected a 
basic model consisting of a box whose dimensions have been selected taking into account the dimensions of an average adult 
person. It has been assumed that the box is 0.5 m in width and 1.8 m in height. Although the model dimensions are fixed in 
this work, they can be adapted to the particular characteristics of the people being observed. Since the cameras are cali- 
brated, it is possible to calculate the projection of the 3D model in a given position x’. Let us define by 


pm(x)” = {Di = (Xi, Yi}, 


the image pixels of the 2th camera image that lies in the projection of a 3D model placed at x’. Fig. 3 shows the projection of 
the model employed in four different cameras. Although in practise a solid model is employed, Fig. 3 shows its wired version 
for viewing purposes. 

Note that some of the pixels in pm(x’)” might belong to background pixels and are therefore not relevant. But some of the 
pixels in pm(x’)” might have already been set as belonging to another person in the occupancy map ©”. Then, let us denote 
by 


— 
X 


fpm(x)” = {Pi|F p, = 1 Api € pm(x)”}, 
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Fig. 3. Projection of the 3D geometric model employed for tracking people. Cameras are calibrated in order to calculate the model projection in each 
camera. 


the pixels in pm’(x) that are foreground pixels (Fp, = 1). Also, let us define by 


ypm(x’)” = {p;|Op, = 0 A p; € fpm(x)”}, 


the pixels from fpm that have not yet been occupied by other people, i.e., Op, = 0. Finally, let us also define Vis( x)” as a mea- 
sure that indicates the visibility of the model projection in the vth camera. This measure accounts for the possibility that the 
model will not project entirely in the camera plane. The measure Vis(x’)” is 1 when the whole model is projected in the cam- 
era image. However, it tends to 0 as the model projects outside the camera’s field of view. So, Vis(x)” = 0 means that the 
particle is not visible from the sth camera. 


4.3. Person colour model 


A colour histogram a;’, is maintained at each camera to model the colours of the clothes of the person being tracked. Col- 
our histograms have often been used for modelling colour in tracking problems since they allow the global properties of ob- 
jects to be captured with invariability to scale, rotation and translation [11]. In this work, histograms are created using the 
colour of the non-occluded foreground pixels in a model projection, i.e., ypm(x’)”. The HSV colour space [23] has been em- 
ployed because it is relatively invariable to illumination changes. A histogram is comprised of n,n, bins for the hue and sat- 
uration. However, as chromatic information is not reliable when the value component is too low or too high, pixels in that 
situation are not used to describe the chromaticity. Because these “colour-free” pixels might contain important information, 
histograms are also populated with n, bins to capture their luminance information. Thus, histograms are composed of 
m = myn, +n, bins. Let us define a function b : R? — {1,...,m} which associates a pixel p; with the index of the histogram 
bin b(p;) = w corresponding to its colour. Then, the wth bin of a histogram is calculated as 

ow) = Snem bP -w 


17 
jvpm(x)”| a 


were k is the Kronecker’s delta function and || denotes the cardinal. Please notice that the histogram bins are normalised: 


4.4. Degrees of evidence calculation 

This section explains the basic probability assignment for the masses M” (X5) of each particle in each camera. For the 
sake of clarity, scripts i, p, v and t are omitted. 

As previously explained, the masses of the power set elements are evaluated for each particle-camera pair: 


M(X) = {m(present), m(-present), m(unknown)}. 


The mass m(unknown) is the degree at which a sensor cannot provide a solution to the problem. This can be seen as the 
uncertainty of the sensor or as the inability of the sensor to decide between the two other subsets present and —present. 
The mass m(unknown) is modelled by two components: an occlusion measure and the visibility measure Vis(x’). 

The occlusion measure, Oclu(x’), indicates the portion of points of the model projection that are occluded by other people. 
It is defined as: 


Oclu(x’) = 1 — _lepm(x)|_ 
fpm(x)| + € 
where € is a small value to prevent dividing by 0. The mass of the unknown subset is then defined as: 


m(unknown) = 1 — (Oclu(x’) » Vis(x’)). (18) 
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The value calculated in Eq. (18) tends to 0 for fully visible particles with low occlusion. However, m(unknown) increases as 
the visibility is reduced or the portion of occlusion becomes greater. 

The masses m(present) and m(—present) are calculated simultaneously since we consider that the former complements 
the latter. While m(present) denotes the evidence of a particle to be placed at the person’s location, m(=present) means ex- 
actly the opposite. It has been assumed that a particle is likely to be at the person’s location (thus obtaining high values of 
m(present) ) if three conditions are met. Firstly, the particle should be placed at an occupied region of the space (i.e., empty 
regions are unlikely to be occupied by people). Secondly, the particle should be projected in the centre of the target instead of 
in its boundaries. Thirdly and finally, the colour distribution of the foreground points in the particle projection should be 
similar to the colour distribution of the target. The overall idea is that a particle is assigned high values of m(present) if it 
projects in a region of the space with sufficient foreground points, they are centred, and their colour distribution is the same 
as the colour model of the person being tracked. These three conditions are evaluated by the three measures Occ(x’), 
Centr(x’) and Cd(x’) explained below. 

The first measure, Occ(x’), indicates if the amount of foreground points in the image region where the model projects is 
appropriate to consider it occupied by a person. For that purpose, let us define 


fpm(x)| 

beke(x) = 1 -PEV (19) 
pm(x)| + € 

as the proportion of background points of the image region where the model projects. The measure Occ(Xx’) is calculated by 

applying a Butterworth filter to the previous measure. Butterworth filters are defined as: 


1 
B JIC) = ey OH? 
Ffon) +h 


where parameter n is the order of the filter controlling the smoothness of the curve and parameter fe is a cutoff value (see 
Fig. 4). The filter response is 1 when fis smaller than f, and tends to 0 as f becomes greater than f,. The occupancy measure is 
then defined as: 


Occ( x’) = B(bckg(X), bocce, Voce): (20) 


Therefore, when the proportion of background points in the projection of the model is below boc, Occ(X’) is 1, indicating that 
the region is properly occupied. However, as the proportion of background points increases, the value of Occ(x’) decreases, 
thus indicating that the region is empty. 

The second measure, Centr(x’), indicates whether the mass centre of the foreground points coincides with the mass cen- 
tre of the projected model. The goal is to assign higher degrees of evidence to particles projected in the centre of the target 
than to particles projected in its boundaries. For that purpose, let us define a function that calculates the centre of mass of a 
point set ps as: 


Cent(ps) = X` mt (21) 
Pi Eps 
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Fig. 4. Response of the Butterworth filter for different configurations. 
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Then, Cent(pm(x’)) denotes the mass centre of the model projection and Cent(fpm(x’)) the mass centre of the foreground 
points. We therefore define the distance 


dna) = Memeo) — Cent orm) ay 
m(x Jneignt + PM( X widen 


as a normalised distance between the two centres. In Eq. (22), || || denotes the Euclidean distance, while P(X ) height and 
pM(X)wian represent the height and width of the model projection, respectively. Therefore, the denominator represents 
the maximum possible distance between two points in the rectangle enclosed by the model projection. Normalisation is 
done in order to achieve independence from the distance of the particle to the camera. Finally, Dn(x’) is modelled as a Gauss- 
ian distribution 


N2 
Dn( X) = exp (- 2 ) ) | (23) 





with Gan = 1/3 so that Dn(x’) ~ 0 when dn(x’) = 1. 

Finally, measure Cd(x’) represents the colour distance between the colour distribution of the pixels in vpm(x’) and the 
colour model of the person (see Section 4.3). A popular approach for measuring the similarity of two distributions is the 
Bhattacharyya distance [1,33]. The Bhattacharyya distance of two colour histograms p} and p, is calculated as: 


cd(P1, P2) = 4/1 - >; p(w), P(W)>. (24) 


Ww 


The distance cd(p,, p») is 0 when both colour histograms are identical and tends to 1 as they differ. Using the Bhattacharyya 
distance we define the measure 


Cd(x’) = cd(a, a) (25) 


indicating the colour distance between the colour histogram of the target (a) and the colour histogram of the points in the 
particle projection vpm(x’) (denoted as â). 
Using the three measures explained above, the mass of the present subset is defined as: 


m(present) = (1 — m(unknown)) * Occ(X’) x Dn(x’) x Cd(%’). (26) 


As can be noticed, m(present) has high values when the m(unknown) is low, the number of foreground points in the projec- 
tion of the 3D model is high, they are centred and their colour distribution is similar to the colour distribution of the person 
being tracked. 

Finally, let us define: 


m(-=present) = 1 — (m(unknown) + m(present)) (27) 


so that the sum of masses is equal to one. 
4.5. Colour model update 


Changes in illumination conditions and body movements might alter the observed colour distribution of a person’s 
clothes. It is therefore necessary to continuously update the people’s models a?,. These are updated using the colour models 


of the best estimated hypothesis E[C,(t)] at each iteration. Let us denote by ay the colour model in the zth view of the best 
particle evaluated (the one with higher fused relevance r?). Then, the bins of the colour histograms are updated as: 


Par (W) = (1 — Ape )ap (w) + Ap dpe (w), (28) 


where parameter 45, € [0,1] weights the contribution of the observed colour model to the updated one. In this work, this 
parameter is set to the mass of the present subset in the 7th view 


Ae. = Mp? (present). (29) 


Please notice that m}? (present) is not a fused evidence but the evidence calculated before the fusion step. Then, the colour 
model of each view is updated independently according to its circumstances. Parameter 2,, is near 1 when the person is 
highly visible and the colour models af, and a she are similar. If the occlusion is high or the colour models are different, 
2, tends to 0. The goal is to prevent aid colour T that might be caused by occlusions or momentary tracking failures. 
Therefore, we are assuming that light changes occur smoothly. 

Finally, the updated histogram is normalised so that their bins sum up to one. 
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5. Experimental results 


This section explains the experiment conducted to test the proposed algorithm. Several video sequences have been re- 
corded in two different scenarios. The first scenario is the PEIS room [41]; a robotised apartment employed for the develop- 
ment and research of mobile and embedded robotic systems. Four usb web cams were placed at a height of ~ 3 m in slanting 
positions to cover an area of 3 x 4m. The cameras were synchronised via software and set to record at 7 fps with a resolu- 
tion of 320 x 240 pixels. In the second scenario, a total of five firewire cameras were placed at a height of 3 m to cover an 
area of approximately 3 x 3 m. The cameras were synchronised via software and set to record at 5 fps with a resolution of 
320 x 240 pixels. The number of people in the recorded sequences varied from 2 to 6. The people were instructed to move 
about freely in the environment. Therefore, interactions and occlusions are frequent in the recorded videos. 

The performance of the proposed algorithm depends on a set of parameters that needs to be estimated. These parameters 
are the number of particles N, the number of bins of the colour histograms ny, ns, n, and the occupancy parameters əc and 
Voce Of Eq. (20). In order to determine the values for these parameters, the positions of the people in one of the sequences have 
been determined in each frame. For that purpose, a camera was mounted on the ceiling of the first scenario and synchronised 
with the usb cameras. The positions of the people being tracked were then extracted in a total of 2500 frames. In this way, 
quantitative measures of the tracking error can be obtained. The sequence was recorded in the first scenario and shows three 
people entering the environment and moving about while discussing a topic. Images of the sequence can be seen in Fig. 8 
(explained later). 

The best parameter configuration has been estimated in two phases in order to reduce the search space. In the first phase, 
the sequence was processed for Oec = {0,0.05,...,1}, Yoc = {1,2,3,4} and n, = ns =n, = {0,2,5,...,14}. The goal was to 
determine the best values for these parameters. It has been assumed that the quality of the colour acquisition is equal in 
all the colour channels so that n, =n, =n,. In the first phase, the number of particles was set to a large enough value 
(N = 300) in order to prevent tracking failures due to the lack of particles. Because of the stochastic nature of the algorithm, 
the tests were repeated ten times with different seeds for the random number generator. The error measure employed is the 
root mean-square error (RMSE) of the manually extracted positions and the positions indicated by the trackers in the ten 
runs. The results can be seen in Fig. 5. 

The graph labelled n, =n, = n, = O represents the case when no colour information is employed. In that case, tracking is 
based exclusively on position information, i.e., people are tracked by intersecting the foreground information from all the 
cameras. As can be seen, the algorithm is very sensitive to the occupancy parameters in this case. In general, a value of 
occ > 0.25 is required in order to obtain good results. This means that at least one quarter of the points in the model pro- 
jection are frequently background points. This occurs for two reasons: because the model is normally bigger than real person 
dimensions and because of errors in the foreground segmentation. 

In the cases where boc = 1 and Y æ > 1, background information is not employed, i.e., tracking is based mostly on colour 
information. For Y, = 1, the smooth curve transition means that foreground information is still to be considered. As y,.. in- 
creases, the relevance of foreground information becomes null. It can be noticed that as occ increases, a higher number of 
histogram bins is required in order to obtain good results, i.e., as foreground information becomes less relevant a more pre- 
cise colour model is required. Nevertheless, the algorithm does not perform well when tracking is based exclusively on col- 
our information. This might be explained by a drift in the colour models due to changes in illumination conditions during 
tracking. As can be seen, the best tracking performance is normally obtained for intermediate values of 6,,- and high values 
of the number of the histogram bins. As regards parameter y,,,, we have observed that it is preferable to set low values for 
this parameter to obtain a smooth transition of the Butterworth filter. In light of the results obtained, we consider that good 
values for the parameters are np, = ns = Ny = 8, Docc = 0.65 and Y e = 1. Although higher values for the number of bins also 
produce good results, we consider that 8 bins constitute an appropriate trade-off between performance and computational 
cost. 

In a second phase, the impact of the number of particles on tracking error is analysed. The more particles employed, the 
higher the computational effort required, but also the higher the precision obtained. However, there must be a limit to the 
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Fig. 5. Tracking errors for different values of the algorithm parameters. 
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Fig. 6. Error evolution for several camera configurations as the number of particles grows. 
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Fig. 7. Colour map used to represent the degrees of evidence in Fig. 8. 


number of particles over which no significant improvement is obtained. Thus, it is desirable to determine the minimum 
number of particles required for an optimal trade-off between precision and computational cost. For that purpose, the se- 
quence has been processed using the best parameter selection of the previous phase and a different number of particles. Fur- 
thermore, the sequence has been processed with an increasing number of cameras in order to determine the impact of the 
number of views on algorithm performance. The results obtained are shown in the graph of Fig. 6. The horizontal axis of the 
graph represents the number of particles employed for each tracker. The vertical axis represents the RMSE in determining 
the people’s position. The RMSE for the different camera configurations are depicted with different coloured lines. As ex- 
pected, error is reduced as the number of particles is increased. However, it can be observed that the reduction is greater 
from 15 to 30 particles. In fact, in the best camera configuration (four cameras) no error reduction is obtained for more than 
30 particles. In that case, the mean error is 0.15 m. 

The tests have been performed on an AMD Turion 3200 portable computer with 1 GB of RAM running Linux. In our tests, 
foreground extraction and colour conversion consumes 10 ms (for each image), while another 10 ms are required for the 
background update. Evaluation of the masses of 30 particles requires 11 ms for each view, while the final data fusion step 
requires 2 ms (for the four cameras). 

Fig. 8 shows some scenes from the previously analysed sequence. In the sequence, three people enter the room and talk 
for approximately 3 min. The people move about the room causing frequent occlusions to each other in some of the cameras. 
The figure shows the tracking results in four different time instants. The odd rows show the camera images at a particular 
time instant. In the images, the models have been drawn at the position estimated by the tracker. Below the camera images, 
the figure shows a ground map of the monitored area where the location and orientation of the cameras have been super- 
imposed. The particles have been drawn in the ground maps in the form of circles whose colour indicates their degree of 
evidence. The colour scheme employed is indicated in Fig. 7. Each target is represented by a different colour: red,! green 
and blue. The particles of the red target are drawn in pure red when m(present) = 1 and m(unknown) = m(-present) = 0. Black 
is used for particles with m(unknown) = 1 and m(present) = m(—present) = O and white for particles with m(—present) = 1 and 
m(unknown) = m(present) = 0. The rest of possible intermediate values for the masses of a particle are represented by the col- 
ours inside the corresponding triangle. The maps labelled fusion show the evidence resulting from the data fusion step. 





1 For interpretation of colour in figures, the reader is referred to the Web version of this article. 
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Fig. 8. Tracking results for three frames of a sequence. The upper rows show the camera images superimposing the positions estimated by the proposed 
tracking algorithm. In the bottom rows the evaluated particles are drawn in a ground map of the scenario. The colour of each particle represents the masses 
calculated according to the colour scheme shown at the bottom. See text for details. 
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In the first frame analysed, the red target is occluded by the blue one in the first camera. It can be noticed that the colour 
of the particles for the first camera are drawn in dark grey, thus indicating that the target is occluded in the camera. Nev- 
ertheless, since the target is properly seen in the rest of the cameras, the final location estimated by the tracker is very accu- 
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Fig. 9. Four people move randomly about a lab causing frequent occlusions and leave the field of view of some of the cameras. 


rate. In the second frame, the blue target is passing between the other two targets. In this situation, the person is occluded 
simultaneously in at least two of the cameras. Nevertheless, as can be seen in the third frame, tracking can be done properly. 
Finally, the fourth frame shows another interesting situation: the blue target bends over to type on a keyboard. In this case, 
the portion of foreground points seen for the person is very low, thus causing the algorithm to assign a low relevance to the 
evaluated particles. However, the best hypothesis evaluated provides a good estimation of the person’s location as can be 
seen in the figure. 

Fig. 9 shows one of the test sequences recorded in the second scenario. In this sequence, four people enter the room and 
Start to walk around it randomly. It is a difficult tracking situation since three of the people are wearing clothes with very 
similar colour distributions. The targets labelled in blue, light blue and red are wearing black and white clothes. Despite this, 
the algorithm is able to track them without confusing their identities and avoiding the coalescence problem. Another aspect 
that is worth mentioning about this video is that the people are not seen simultaneously in all the cameras. Certain positions 
of the environment are only covered by a subset of the cameras. This is especially evident for the first camera, which has a 
longer focal length. As previously explained, particles drawn in unreachable regions for a camera are set with high values of 
m(unknown). Therefore, the rest of the cameras are used to determine the person’s location. 

From the experiments performed, we have seen that under-segmentation is the main source of error of the proposed 
algorithm. Under-segmentation occurs when the target’s colour is similar to the colour of the foreground. In this case, the 
target cannot be distinguished from the environment since it produces no foreground points. This situation is shown in 
the first row of Fig. 10. The figure shows some frames of a sequence where six-people are tracked simultaneously. The figure 
shows both the camera images and the foreground maps immediately below. The foreground pixels are coloured with the 
colour of the person they belong to. The under-segmentation problem is particularly evident in the first scene (top row) for 
the person marked in red. He is wearing black clothes whose colour is very similar to the background in the cameras cam(3) 
and cam(4). The problem with under-segmentation is that the portion of background points becomes very high. Thus, the 
algorithm might consider that the region is empty. Of course, if the situation is repeated in most of the cameras, the person 
cannot be tracked. However, if there are more cameras without under-segmentation, the fusion method is able to determine 
the person’s position. 


6. Conclusions 


In this paper we have proposed a novel evidential filtering approach that can be considered an extension of the Bayesian 
particle filters to the Demspter-Shafer theory of evidence. The proposed algorithm is specifically designed for tracking multi- 
ple targets by fusing information from multiple unreliable sensors. The management of uncertainty in our approach is par- 
ticularly attractive due to its simplicity and because it does not require specifying priors nor conditionals that might be 
difficult to obtain in complex problems. To avoid the curse of dimensionality that arises when joint configurations are em- 
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Fig. 10. Camera images and foreground extracted for a six-people sequence. Foreground segmentation errors cause problems in the tracking process (see 
text for details). 


ployed, a separate tracker is used for each target. Interactions between targets are modelled in order to maintain multi- 
modality. 

The proposed algorithm is employed to provide a novel solution to the multi-camera people tracking problem. Targets are 
tracked combining foreground, colour and shape information. The proposed evidence particle filter is especially appropriate 
for modelling the frequent occlusions that occur in the multi-camera tracking problem. For that purpose, an occupancy map 
is used to detect target occlusions. The occupancy map is computed independently for each camera using a depth-ordered 
scheme. Therefore, the evidence can be estimated concurrently in each camera. When a particle is placed at a position hidden 
to a camera (due to occlusion or because the particle is out of the camera’s field of view), the camera indicates that its knowl- 
edge about that location is unreliable. Therefore, information from unreliable cameras is weakly considered in the final data 
fusion step. The test performed shows that the proposed algorithm is able to estimate the locations of the people being 
tracked using a reduced number of particles and under severe occlusion conditions. 
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